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Abstract 

Dense subgraphs of sparse graphs (communities), which appear in most real- world complex networks, play an 
important role in many contexts. Computing them however is generally expensive. We propose here a measure 
of similarities between vertices based on random walks which has several important advantages: it captures well 
the community structure in a network, it can be computed efficiently, it works at various scales, and it can be 
used in an agglomerative algorithm to compute efficiently the community structure of a network. We propose such 
an algorithm which runs in time 0(rnn 2 ) and space 0(n 2 ) in the worst case, and in time 0(n 2 logn) and space 
0{n 2 ) in most real- world cases (n and m are respectively the number of vertices and edges in the input graph). 
Experimental evaluation shows that our algorithm surpasses previously proposed ones concerning the quality of the 
obtained community structures and that it stands among the best ones concerning the running time. This is very 
promising because our algorithm can be improved in several ways, which we sketch at the end of the paper. 



1 Introduction 

Recent advances have brought out the importance of complex networks in many different domains such as 
sociology (acquaintance networks, collaboration networks), biology (metabolic networks, gene networks) or 
computer science (Internet topology, Web graph, P2P networks). We refer to |31l 1291 HI 1211 17j for reviews 
from different perspectives and for an exhaustive bibliography. The associated graphs are in general globally 
sparse but locally dense: there exist groups of vertices, called communities, highly connected between them 
but with few links to other vertices. This kind of structure brings out much information about the network. 
For example, in a metabolic network the communities correspond to biological functions of the cell [201 ■ In 
the Web graph the communities correspond to topics of interest ^3 ^] . 

This notion of community is however difficult to define formally. Many definitions have been proposed 
in social networks studies [2], but they are too restrictive or cannot be computed efficiently. However, 
most recent approaches have reached a consensus, and consider that a partition V = {Ci, . . . , C'k} of the 
vertices of a graph G = (V, E) (Vi,Cj C V) represents a good community structure if the proportion of 
edges inside the Ci (internal edges) is high compared to the proportion of edges between them (see for 
example the definitions given in |13|1. Therefore, we will design an algorithm which finds communities 
satisfying this criterion. 

We will consider throughout this paper an undirected graph G = (V, E) with n = \V\ vertices and 
m = \E\ edges. We impose that each vertex is linked to itself by a loop (we add these loops if necessary). 
We also suppose that G is connected, the case where it is not being treated by considering the components 
as different graphs. 

1.1 Our approach and results 

Our approach is based on the following intuition, already pointed out in 0]: small length random walks on 
a graph tend to get "trapped" into densely connected parts corresponding to communities. We therefore 
begin with a theoretical study of random walks on graphs. Using this, we define a measurement of the 
structural similarity between vertices and between communities, thus defining a distance. We relate this 
distance to existing spectral approaches of the problem. But our distance has an important advantage on 
these methods: it is efficiently computable, and can be used in a hierarchical clustering algorithm (merging 
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iteratively the vertices into communities). One obtains this way a hierarchical community structure that 
may be represented as a tree structure called dendrogram (an example is provided in Figure^). We propose 
such an algorithm which finds a community structure in time 0(mnH) where H is the height of the 
corresponding dendrogram. The worst case is 0(mn 2 ). But most real- world complex networks are sparse 
(m = 0(n)) and, as already noticed in [5], H is generally small and tends to the most favourable case in 
which the dendrogram is balanced (H = C(log?i)). In this case, the complexity is therefore (D(n 2 logn). We 
finally evaluate the performance of our algorithm with different experiments which show that it surpasses 
previously proposed algorithms. 

1.2 Related work 

There exist many algorithms to find community structure in graphs. Most of them result from very recent 
works, but this topic is related to the classical problem of graph partitioning that consists in splitting a 
graph into a given number of groups while minimizing the cost of the edge cut [1 II 1241 ITS] . However, these 
algorithms are not well suited to our case because they need the number of communities and their size 
as parameters. The recent interest in the domain has started with a new divisive approach proposed by 
Girvan and Newman |15l l'2Hj : the edges with the largest betweenness (number of shortest paths passing 
through an edge) are removed one by one in order to split hierarchically the graph into communities. This 
algorithm runs in time 0(m 2 n). Similar algorithms were proposed by Radicchi et al |25j and by Fortunato 
et al ^31- The first one uses a local quantity (the number of loops of a given length containing an edge) 
to choose the edges to remove and runs in time 0(m 2 ). The second one uses a more complex notion of 
information centrality that gives better results but poor performances in 0(m 3 n). 

Hierarchical clustering is another classical approach introduced by sociologists for data analysis El ■ 
From a measurement of the similarity between vertices, an agglomerative algorithm groups iteratively the 
vertices into communities (there exist different methods differing on the way of choosing the communities 
to merge at each step) . We will use this approach in our algorithm and other agglomerative methods have 
also been recently introduced. Newman proposed in \22\ a greedy algorithm that starts with n communities 
corresponding to the vertices and merges communities in order to optimize a function called modularity 
which measures the quality of a partition. This algorithm runs in 0(mn) and has recently been improved 
to a complexity O (mH logn) (with our notations) [5]. The algorithm of Donetti and Munoz [3] uses a 
hierarchical clustering method: they use the eigenvectors of the Laplacian matrix of the graph to measure 
the similarities between vertices. The complexity is determined by the computation of all the eigenvectors, 
in 0(n 3 ) time for sparse matrices. 

In the current situation, one can process graphs with up to a few hundreds of thousands vertices using 
the method in [S]. All other algorithms have more limited performances (they generally cannot manage 
more than some thousands of vertices). 

2 Preliminaries on random walks 

The graph G is associated to its adjacency matrix A: A^ = 1 if vertices i and j are connected and Aij — 
otherwise. The degree d(i) = Yl j Aij °f the vertex i is the number of its neighbors (including itself). To 
simplify the notations, we only consider unweighted graphs in this paper. It is however trivial to extend 
our results to weighted graphs (A^ £ M. + instead of A,^ £ {0, 1}), which is an advantage of this approach. 

Let us consider a discrete random walk process (or diffusion process) on the graph G (see [2U1 |3] for a 
complete presentation of the topic). At each time step a walker is on a vertex and moves to a vertex chosen 
randomly and uniformly among its neighbors. The sequence of visited vertices is a Markov chain, the states 
of which are the vertices of the graph. At each step, the transition probability from vertex i to vertex j is 
Pa = 4rA- This defines the transition matrix P of the random walk. We can also write P = D~ 1 A where 
D is the diagonal matrix of the degrees (Vi, Da = d(i) and Dij = for i ^ j). 

The process is driven by the powers of the matrix P: the probability of going from i to j through a 
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random walk of length t is (P*)ij. In the following, we will denote this probability by Pfj. It satisfies two 
general properties of the random walk process (see proofs in Appendix A) which we will use in the sequel: 

Property 1 When the length t of a random walk starting at vertex i tends towards infinity, the probability 
of being on a vertex j only depends on the degree of vertex j (and not on the starting vertex i): 

Property 2 The probabilities of going from i to j and from j to i through a random walk of a fixed length 
t have a ratio that only depends on the degrees d(i) and d(j): 



Vi,Vj,d(i)i& =d(j)^ 



3 Comparing vertices using short random walks 

In order to group the vertices into communities, we will now introduce a distance r between the vertices 
that reflects the community structure of the graph. This distance must be large if the two vertices are in 
different communities, and on the contrary if they are in the same community it must be small. It will be 
computed from the information given by random walks in the graph. 

Let us consider random walks on G of a given length t. We will use the information given by all the 
probabilities Pfj to go from i to j in t steps. The length t of the random walks must be sufficiently long 
to gather enough information about the topology of the graph. However t must not be too long, to avoid 
the effect predicted by Property^] the probabilities would only depend on the degree of the vertices. Each 
probability Ptj gives some information about the two vertices i and j, but Property El says that Pfj and Pj t 
encode exactly the same information. Finally, the information about vertex i encoded in P resides in the 
n probabilities (P* k )i<k<n, which is nothing but the i th row of the matrix P*, denoted by PP m . To compare 
two vertices i and j using these data, we must notice that: 

• If two vertices i and j are in the same community, the probability Pfj will surely be high. But the 
fact that Pfj is high does not necessarily imply that i and j are in the same community. 

• The probability Pfj is influenced by the degree d(j) because the walker has higher probability to go 
to high degree vertices. 

Two vertices of a same community tend to "see" all the other vertices in the same way. Thus if i and 

ih jk' 



j are in the same community, we will probably have Vfc, P\ k ~ P* 



We can now give the definition of our distance between vertices, which takes into account all previous 
remarks: 

Definition 1 Let i and j be two vertices in the graph: 



D-?P l -D-ip* (1) 

where ||.|| is the Euclidean norm of M n . 

One can notice that this distance can also be seen as the L 2 distance between the two probability 
distributions P/ # and P^ m . Notice also that the distance depends on t and may be denoted (t) . We will 
however consider it as implicit to simplify the notations. 

Theorem 1 The distance r is related to the spectral properties of the matrix P by: 

n 

4 = £ A «M0-^(i)) 2 

where (X a )i<a<n and {v a )i< a < n are respectively the eigenvalues and right eigenvectors of the matrix P. 
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In order to prove this theorem, we need the following technical lemma: 

Lemma 1 The eigenvalues of the matrix P are real and satisfy: 

1 = Ai > A 2 > ... > A„ > -1 

Moreover, there exists an orthonormal family of vectors (s a )i< a <„ such that each vector v a = D~^s a and 
u a = D^s a are respectively a right and a left eigenvector associated to the eigenvalue X a : 

Va, Pv a = X a v a and P T u a = A a u a 

Va, V/3, v^up = 6 a p 

Proof : The matrix P has the same eigenvalues as its similar matrix S = D^PD~^ = AD~? . The 
matrix S is real and symmetric, so its eigenvalues A Q are real. P is a stochastic matrix (X}™=i Pij = 1)> so 
its largest eigenvalue is Ai = 1. The graph G is connected and primitive (the gcd of the cycle lengths of G 
is 1, due to the loops on each vertex), therefore we can apply the Perron- Frobenius theorem which implies 
that P has a unique dominant eigenvalue. Therefore we have: |A a | < 1 for 2 < a < n. 

The symmetry of S implies that there also exists an orthonornal family s a of eigenvectors of S satisfying 
Va,V/3, s^sp = S a p (where S a p = 1 if a = /3 and otherwise). We then directly obtain that the vectors 
v a = s a and u a = s a are respectively a right and a left eigenvector of P satisfying u^vp = Sap. O 

We can now prove Theorem ^ 

Proof : Lemma H ma kes it possible to write a spectral decomposition of the matrix P: 



P = ^2 Kv a u^, and P* = ^ A^w Q u^, and so P-j = ^ A^u a (i)« a (j) 

a—1 a—1 a—1 

Now we obtain the expression of the probability vector P\ % : 

n n 

Pi. = X a v a{i)u a =D^Y1 A^ Q (i)s Q 

a—1 a—1 

We put this formula into the second definition of r-y given in Equation Q). Then we use the Pythagorean 
theorem with the orthonormal family of vectors (s a )i< Q <„, and we remember that the vector vi is constant 
to remove the case a = 1 in the sum. Finally we have: 

n 2 n 

a=l a=2 

□ 



r 2 - 

i j 



This theorem relates random walks on graphs to the many current works that study spectral properties of 
graphs. For example, [2H| notices that the modular structure of a graph is expressed in the eigenvectors of P 
(other than vi) that corresponds to the largest positive eigenvalues. If two vertices i and j belong to a same 
community then the coordinates v a (i) and v a (j) are similar in all these eigenvectors. Moreover, |27U14j show 
in a more general case that when an eigenvalue A a tends to 1, the coordinates of the associated eigenvector 
v a are constant in the subsets of vertices that correspond to communities. A distance similar to ours (but 
that cannot be computed directly with random walks) is also introduced: df(i,j) — Y^a=2 ^^^rfi-t . 
Finally, |B] uses the same spectral approach applied to the Laplacian matrix of the graph L = D — A. 

All these studies show that the spectral approach takes an important part in the search for community 
structure in graphs. However all these approaches have the same drawback: the eigenvectors need to be 
explicitly computed (in time 0(n 3 ) for a sparse matrix). This computation rapidly becomes untractablc 
in practice when the size of the graph exceeds some thousands of vertices. Our approach is based on the 
same foundation but has the advantage of avoiding the expensive computation of the eigenvectors: it only 
needs to compute the probabilities P*- , which can be done efficiently as shown in the following theorem. 
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Theorem 2 All the probabilities P/- can be computed in time 0{tnm) and space 0(n 2 ). Once these 



probabilities computed, each distance can be computed in time 0[n). 
compute directly r^ in time 0(tm) and space 0(n). 



For given i and j , one can also 



Proof : To compute the vector P* # , we multiply t times the vector P 4 P (Vfc, P? (fc) = 8^) by the matrix 
P. This direct method is advantageous in our case because the matrix P is generally sparse (for real-world 
complex networks) therefore each product is processed in time 0(m). The initialization of P^ is done in 
0(n) and thus each of the n vectors Pf m is computed in time 0(n + tm) = 0(tm). Once we have the two 
vectors Pf m and Pj # , we can compute rij in 0(n) using Equation QJ. We can compute and keep in memory 
all the probability vectors in time 0{tnm) or compute directly by evaluating the two vectors P\ m and 
P*. in time 0{tm). □ 



Now we generalize our distance between vertices to a distance between communities in a straightforward 
way. Let us consider random walks that start from a community: the starting vertex is chosen randomly 
and uniformly among the vertices of the community. We define the probability P^ ■ to go from community 
C to vertex j in t steps: 



P t 



= — V P*. 



This defines a probability vector P<£, # that allows us to generalize our distance: 

Definition 2 Let Ci,C% C V be two communities. We define the distance rc^c? between these two com- 
munities by: 



D~ 



r pt 



D-*P l r 



d{k) 



This definition is consistent with the previous one: Tj 



a vertex i and a community C: r^c : 
r CiC2 i s a l so computed in time 0{n). 



"{i}{J} 



and we can also define the distance between 



"{i}C- 



Given the probability vectors P<£ 



Ci. 



and P£ 2i 



the distance 



4 The algorithm 

In the previous section, we have proposed a distance between vertices (and between sets of vertices) which 
captures structural similarities between them. The problem of finding communities is now a clustering 
problem. We will use here an efficient hierarchical clustering algorithm that allows us to find community 
structures at different scales. We present an agglomerative approach based on Ward's method [3U] that is 
well adapted to our distance and gives very good results while reducing the number of distance computations 
in order to be able to process large graphs. 

We start from a partition V\ = {{«}, v G V} of the graph into n communities reduced to a single vertex. 
We first compute the distances between all adjacent vertices. Then this partition evolves by repeating the 
following operations. At each step fc: 

• Choose two communities C\ and Ci in Vk on a criterion based on the distance between the communities 
that we detail later. 

• Merge these two communities into a new community C3 = C\ U C2 and create the new partition: 
Vk+i = (P fc \{Ci,C 2 })U{C 3 }. 

• Update the distances between communities (we will see later that we actually only do this for adjacent 
communities) . 

After n — 1 steps, the algorithm finishes and we obtain V n = {V}- Each step defines a partition of 
the graph into communities, which gives a hierarchical structure of communities called dendrogram (see 
Figurc^b)). This structure is a tree in which the leaves correspond to the vertices and each internal node 
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is associated to a merging of communities in the algorithm: it corresponds to a community composed of 
the union of the communities corresponding to its children. 

The key points in this algorithm are the way we choose the communities to merge, and the fact that 
the distances can be updated efficiently. We will also need to evaluate the quality of a partition in order 
to choose one of the Vu as the result of our algorithm. We will detail these points below, and explain how 
they can be managed to give an efficient algorithm. 



Choosing the communities to merge. This choice plays a central role for the quality of the 
community structure created. In order to reduce the complexity, we will only merge adjacent communities 
(having at least an edge between them). This reasonable heuristic (already used in [22] and [H]) limits to 
m the number of possible mcrgings at each stage. Moreover it ensures that each community is connected. 

We choose the two communities to merge according to Ward's method. At each step k, we merge the two 
communities that minimize the mean ak of the squared distances between each vertex and its community. 

cer k iec 

This approach is a greedy algorithm that tries to solve the problem of maximizing for each k. But this 
problem is known to be NP-hard: even for a given k, maximizing is the NP-hard "K-Median clustering 
problem" ^| |S] for K = (n — k) clusters. The existing approximation algorithms |101 [S] are exponential 
with the number of clusters to find and unsuitable for our purpose. So for each pair of adjacent communities 
{Ci, C 2 }, we compute the variation Act(Ci, C 2 ) of a if we would merge C\ and C 2 into a new community 
C3 = C\ U C2 • This quantity only depends on the vertices of C\ and C2 , and not on the other communities 
or on the step k of the algorithm: 

Aa(C u C 2 ) = 1(2 rh 3 - £ rf Cl £ rQ (2) 
iec 3 ieCi iec 2 

Finally, we merge the two communities that give the lowest value of Aer. 



Computing Act and updating the distances. The important point here is to notice that these 
quantities can be efficiently computed thanks to the fact that our distance is a Euclidean distance, which 
makes it possible to obtain the two following classical results [TJ\ (proofs in Appendix A): 

Theorem 3 The increase of a after the merging of two communities C\ and C 2 is directly related to the 
distance rc t c 2 by: 

A IC C \ 1 \9^A 2 
Aff(Cl ' C2) = «NTN rClC2 

This theorem shows that we only need to update the distances between communities to get the values 
of Act: if we know the two vectors Pc lm and Pc 2 . , the computation of Aa(Ci,C 2 ) is possible in 0(n). 
Moreover, the next theorem shows that if we already know the three values Act(Ci,C2), Act(Ci,C) and 
Act(C2, C), then we can compute Acr(Ci U C 2 , C) in constant time. 

Theorem 4 (Lance- Williams- Jambu formula) If C\ and C 2 are merged into C3 = C\ U C 2 then for 
any other community C : 

a (r r , (\Ci\ + |C|)Act(C!,C) + (\C 2 \ + \C\)Aa(C 2 ,C) - |C|ACT(d,C 2 ) 

Aa(C3 ' c) = i^i + i^i + iq (3) 

Since we only merge adjacent communities, we only need to update the values of Act between adjacent 
communities (there are at most m values). These values are stored in a balanced tree in which we can add, 
remove or get the minimum in O(logm). Each computation of a value of Act can be done in time 0(n) 
with Theorem|21or in constant time when Thcorcm0]can be applied. 
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Evaluating the quality of a partition. The algorithm induces a sequence (Vk)i<k<n 01 partitions 
into communities. We now want to know which partitions in this sequence are good representations of 
communities. The most common way is to use a statistical parameter such as the modularity Q introduced 
in This quantity (between —1 and 1) is well suited to find the best partition but not to find 

several ones (corresponding to other scales in the hierarchical structure, see Appendix B). Here we provide 
another criterion that helps in finding different scales of communities. When we merge two very different 
communities (with respect to the distance r), the value Aak = &k+i — &k at this step is large. Conversely, 
if AfTfc is large then the communities at step k — 1 are surely relevant. To detect this, we introduce the 
increase ratio r\k'- 

AcTfc Gk+l - (7k 

Vk = -r = 

A(7fc_l (7k — <7k-l 

We then assume that the best partitions Vk are those associated with the largest values of Depending 
on the context in which our algorithm is used, one may take only the best partition (the one for which rj^ 
is maximal) or choose among the best ones using another criterion (like the size of the communities, for 
instance). This is an important advantage of our method, which gives different scales in the community 
structure, as illustrated in Appendix B. 




Figure 1: (a) An example of community structure found by our algorithm using random walks of length 
t = 3. (b) The stages of the algorithm encoded as a tree called dendrogram. The maximum of r/k and Q, 
plotted in (c), show that the best partition consists in two communities. 

Complexity. First, the initialization of the probability vectors is done in 0(mnt). Then, at each step 
k of the algorithm, we keep in memory the vectors P l Q m corresponding to the current communities (the 
ones in the current partition) . But for the communities that are not in Vk (because they have been merged 
with another community before) we only keep the information saying in which community it has been 
merged. We keep enough information to construct the dendogram and have access to the composition of 
any community with a few more computation. 

When we merge two communities C\ and C2 we perform the following operations: 

• Compute P[ ClUC2) . = fejlSl remove P^. and P£ 2 . . 

• Update the values of Act concerning C\ and C2 using Theorem 0] if possible, or otherwise using 
Theorem 
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The first operation can be done in 0(n), and therefore docs not play a significant role in the overall 
complexity of the algorithm. The dominating factor in the complexity of the algorithm is the number of 
distances r computed (each one in O(n)). We prove an upper bound of this number that depends on the 
height of the dendrogram. We denote by h(C) the height of a community C and by H the height of the 
whole tree (H = h(V)). 

Theorem 5 An upper bound of the number of distances computed by our algorithm is 2mH . Therefore its 
global time complexity is 0(mn(H + t)). 

Proof : Let M be the number of computations of Act. M is equal to m (initialization of the first Act) plus 
the sum over all steps k of the number of neighbors of the new community created at step k (when we 
merge two communities, we need to update one value of Act per neighbor). For each height 1 < h < H, 
the communities with the same height h are pairwise disjoint, and the sum of their number of neighbor 
communities is less than 2m (each edge can at most define two neighborhood relations). The sum over all 
heights finally gives M < 2Hm. Each of these M computations needs at most one computation of r in 
time 0(n) (Theorem|3J). Therefore, with the initialization, the global complexity is 0(mn(H + 1)). □ 

In practice, a small t must be chosen (t = O(logn), because if it is not the case the random walks 
converge to the limit distribution at exponential speed) and thus the global complexity is 0(mnH). The 
worst case is H = n — 1, which occurs when the vertices are merged one by one to a large community. 
This happens in the "star" graph, where a central vertex is linked to the n — 1 others. However Ward's 
algorithm is known to produce small communities of similar sizes. This tends to get closer to the favorable 
case in which the community structure is a balanced tree and its height is H = O(logn). 

5 Experimental evaluation of the algorithm 

Evaluating a community detection algorithm is difficult because one needs some test graphs whose com- 
munity structure is already known. A classical approach, which we will follow here, is to use randomly 
generated graphs with communities defined as follows: one constructs a graph with n vertices and c > 1 
disjoint communities of — vertices. An internal and an external density of edges pi„ and p ou t are given. 
Each possible edge inside a community is drawn with probability pi n and each possible edge between two 
communities is drawn with probability p out . These two probabilities define an expected average in-degree 
Zin = Pin(% - 1) and an expected average out-degree z out = Pout ~ ■ 

In order to evaluate the performance of our algorithm we will evaluate the ratio of vertices correctly 
identified by the algorithm. This ratio has been used (without formal definition) in [l'6\ 1151 1231 1221 
Here we define it according to the following identification procedure: we want to identify the c known 
communities (C,)i<i< c to c communities (Cj)i<j<- found by the algorithm. Wc identify each C, to the 
community O^u) such that |Cj n C<y(i) | is maximal. If there are I > 1 communities , . . . Q, identified to 
the same community Cj {j{ii) = • • • = 7(?z) = j), then we only keep the identification of the community 
Ci k which maximizes \Ci k f)Cj\. The other communities Cj , are no more identified to any community. A 
vertex is then correctly identified if it belongs to the community found by the algorithm is identified to its 
actual community. 

In order to compare our algorithm to the other known algorithms, we first study the influence of the 
densities pi n and p out on the same graphs as the ones used in ^3 1151 1231 1221 E] . They considered graphs of 
n =128 and c — 4 communities but different densities pi„ and p ut- The results are plotted in Figure [2J It 
indicates that our algorithm has perfect results when the graph has a clear community structure (i.e. when 
Pin is high and p ou t is low). When p out is high and pi„ is low, the graph does not really have a community 
structure, which explains why our algorithm does not find it. In intermediate cases, our algorithm has 
better performances than previously proposed algorithms, which generally have only been tested in the 
case Zm + Zout = 16. 

In order to deepen the empirical study of the performances of our algorithm, we tested it on various 
situations. In all of them, the performances were very good, the only cases where our algorithm fails becing 
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Figure 2: Performances of our algorithm on graphs with n = 128 vertices and c = 4 communities using 
random walks of length t = 3. Left: ratio of vertices correctly identified as a function of the edges density 
Pin and p ou t (black stands for 0% and white for 100% ). Right: detail of this plot for a constant average 
degree z m + z out = 16. 



the extreme cases in which no clear community structure exists. To illustrate this we detail two of these 
experiments below (and two other in Appendix B). Let us consider graphs of different sizes from n = 100 
to n = 10, 000 with c = 10 communities. The external density of edges is chosen in order to have a mean 
out-degree z out = 8. The internal density Pi n {C) of each community C is randomly and uniformly chosen 
from an interval [p m in--Pmax] such that its mean internal degree satisfies 6 < Zi n {C) < 10. The results 
for t = 5 (Figure left) show that our algorithm has good performances on large graphs even with some 
heterogeneity in the communities. 



nb vertices 


ratio of vertices correctly identified 


time 


100 


99% 


0.05s 


300 


93% 


0.25s 


1,000 


90% 


2.6s 


3,000 


73% 


21s 


10,000 


71% 


llmin a 



"This case has been slowed by the lack of memory on the 
512 MB RAM machine on which the experiments were run. 



Number of communities 



Figure 3: Left: Performances of our algorithm on graphs with c = 10 communities, z ou t = 8 and 
6 < Zi„ < 10 for various sizes of graphs. Right: ratio of correctly identified vertices on the same kind of 
graphs with n = 5000 vertices when the number of communities varies. 



We also studied the influence of the number of communities. We consider the same kind of graphs 
as above with n = 5000 vertices and with different number c of communities. The results are plotted in 
Figurc^right). We chose to keep the same global density of graphs (the expected average degree is always 
16) and increase the number of communities which implies a decrease in their size and their internal density 
of edges. The experiment shows that, even if the overall number of expected external edges is equal to the 
one of internal edges, our algorithm easily detects the communities with a sufficient internal density. 
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6 Improvements 



Several improvements of our algorithm are possible and we investigated some. We do not present them in 
detail in this abstract, but rapidly outline two of the most interesting ones. 

One may first replace the computation of the probabilities P\ % by approximation obtained by running 
a given number K of random walks from vertex i. The precision of the estimation is 0(-J=) and a good 
evaluation is typically obtained with K = 1000. This approach is interesting for very large graphs and 
allows us to estimate each vector Pf, in 0(K(t + log if)). 

Another improvement concerns the discrete nature of random walks. As already noticed, the best length 
t of the random walks is generally small and its choice is difficult. The discrete time is restrictive and may 
not give enough freedom. One may then replace the discrete Markov chain by its continuous version, and 
obtain new probabilities P^ which expression is Vt 6 M, P*j = (e*^ -1 ^)^.. Each probability vector Pf t 

can be efficiently computed in 0[rm) with an error e _t fr by truncating the exponential series at 

a given range r. This improvement makes it possible to choose non integer values for t. 

7 Conclusion and further work 

We proposed a new distance between the vertices that quantify their structural similarities using random 
walks. This distance has several advantages: it captures much information on the community structure, it 
is well suited for approximation, and it can be used in an efficient hierarchical agglomcrative algorithm that 
detects communities in a network at different scales. We designed such an algorithm which works in time 
0(mn 2 ). In practice, real- world complex networks are sparse (m = 0(n)) and the height of the dendrogram 
is generally small (H = O(logn)); in this case the algorithm runs in 0(n 2 logn). This complexity may be 
reduced with the improvements sketched in Sectional 

Most previous methods were unable to manage networks with more than approximately 10, 000 vertices, 
except the one in [3] which goes up to several hundreds of thousands. We ran our algorithm on networks 
of up to 100, 000 vertices, and experiments show that the obtained quality is better than the one obtained 
in 0, and actually better than the one obtained by all previous ones. Several possible improvements have 
been pointed out which will improve the performances of our algorithm. Moreover, our method is well 
suited for detectiong communities at various sacles (see Appendix B). We therefore think that it may be 
considered as a significant step in the area. 

Choosing an appropriate length t of the random walks is however still a problem and we have work 
in progress in this direction. More experiments on real-world complex networks also still have to be 
performed (see Appendix B), as well as direct comparison with algorithms proposed and implemented by 
other authors 1 . Our approach may also be relevant for the computation of overlapping communities (which 
often occurs in real- world cases and is not considered by any algorithm until now), which we consider as a 
promising direction for further work. Finally, we pointed out that the method is directly usable for weighted 
networks. For directed ones (like the important case of the Web graph), on the contrary, the proofs we 
provided are not valid anymore, and random walks behave significantly differently. Therefore, we also 
consider the directed case as an interesting direction. 
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A Proofs of previously known results 

Proof of Property ^ 

n n n 

P = ^2 \ a v a ui, and P* = ^ A* Q w Q u^, and so P^- = ^ A^ Q (i)u Q (j) 

a — 1 a— 1 a— 1 

When £ tends towards infinity, all the terms a > 2 vanish. It is easy to show that the first right eigenvector 
V\ is constant. By normalizing we have Vi,«i(i) = , 1 and Vj, Ui(j) = /J?^ We obtain: 

™ ^(i) 

Proof of Property [2] 

This property can be written as the matricial equation DP t D~ 1 = (P*) T (where M T is the transpose 
of the matrix M). By using P = D~ x A and the symmetry of the matrices D and A. we have: DP t D^ 1 = 
D{D~ 1 A) t D~ 1 = {AD- 1 ) 1 = (A T (D- 1 ) T ) t = ((D^Aff = (P') T 

Proof of Theorem |3] 

First notice that the distance r can be considered as a metric in K n (that contains the probability vectors 
Pc. ) associated to an inner product < .|. >. We have: r\ c =< Pq % — P*. \Pq, — P/, >. In order to clarify 
the text we will use vectorial notation. For all vertex i and community C, we define iS = P^ m — P*_ and 
for any two communities C\ and C2, C\Ci = Pc 2 . ~~ ^Ci. ■ We can write: 

£ r ic 3 = Yl < >= ]T ( < i T C> T C 1 > +2 < iC[\C^ > + < C^\C^ > 

ieCi ieCi ieCi 

We then notice that P^. is the centroid of the vectors {P* a |i S Ci}, therefore we have X^igc *Ci = "0*- 

^ 1 c 1 ^ 

Moreover we also have C1C3 = | Cl | + | g2 | C1C2 and we finally obtain: 

V r 2 - V r 2 + _J£M£M!_ r 2 

h 3 ~h (icii+w ClC2 

This also holds if we replace Ci by C2 and C2 by Ci. Therefore: 

„2 _ „,2 1 „2 _ „2 1 «2 1 |Cl||C- 



21 r 2 



igc 3 <eCi iec 2 ieCi ;ec 2 11 1 



We deduce the claim by replacing this expression into Equation J2J1. 
Proof of Theorem |3] 

We replace the four Act of Equation by their values given by Theorem |3| We multiply each side by 
n(|Ci|+K|2|+|C|) and use |C 3 | = |Ci| + |C 2 |, and obtain the equivalent equation: 

(|Ci| + \C 2 \y c3C = |d|T£ l0 + \C 2 \r% 2C - ^j^, ^ 

Then we use the fact that Pq m is the barycenter of P^ lm weighted by \C\ \ and of Pq weighted by | C2 1 , 
therefore: 

\Ci\r 2 ClC + \C 2 \r 2 C2 c = (\Ci\ + \C 2 \)r 2 C3C + IC^r^ + \C 2 \r 2 C2Ca 
We conclude using + \C 2 \r 2 C2Ca = J^MgL^. 
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B More experimental results 



Influence of the length t on a hierarchically structured network. In this appendix we 
study the influence of the length t of the random walks. To do this we need graphs with a hierarchical 
community structure. Wc generated graphs with n = 256 vertices divided into 8 small communities included 
in 4 medium communities and then in 2 large communities, sec Figure 01a). Wc chose an internal edge 
density in small communities p\ = 0.3, and an edge density between small, medium and large communities 
P2 = 0.1, ps = 0.75 and p± = 0.5, respectively. We ran our algorithm for different t and we computed the 
ratio of vertices correctly identified for the three kinds of communities. The results (Figure^Jb)) show that 
we get good performances for short random walks. We also notice that the range of t that gives good results 
depends on the size of the communities found. There is a relation between the size of the communities to 
identify and the value of t that we must choose. It seems that a good choice of t must leave enough time to 
random walks to reach all the vertices of a community but not enough time to reach all the vertices of the 
graph. This is why a length close to the diameter of the communities to identify seems a relevant choice. 
Moreover these tests show that our approach is able to identify community structures at different scales: 
we clearly have three peaks on rjk corresponding to the three sizes of communities (Figure life)). 




Figure 4: (a) Hierarchical community structure used for the test. Each of the 8 small communities has 32 
vertices, (b) The three ratios (corresponding to the three community sizes) of vertices correctly identified 
by our algorithm as a function of the random walks length t used, (c) Evolution of rjk (last 30 steps) 
showing that we identify the three scales in the community structure. 



Experiment on a real network (Internet map). We tested our algorithm on a map of the 

Internet (provided by Magoni [IS]) that contains 12,929 routers and 52,844 physical links between them. 
Each router belongs to a known Autonomous system (AS) and the aim of the experiment is to see if wc 
can retrieve them using our community detection algorithm, exploring the idea that they may correspond 
to dense subgraphs. 

The map has been established from intensive traceroute experiments and only covers a part of the 
Internet: routers from 383 different AS are represented and are linked by 35,096 internal links and 17,748 
links between different AS. However, due to the measurement method, some AS are poorly discovered: we 
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only see a small number of their routers and their internal links. This phenomenon implies that many small 
AS (small on the map but not necessarily in reality) cannot be considered as communities (from our point 
of view). For instance, 320 of the represented AS (corresponding to 5,186 routers) have the ratio of number 
of external edges by the number of internal edges larger than 1 and 237 AS (2,706 routers) have this ratio 
larger than 2. 

Our algorithm computed a community structure for t = 5 in 5 minutes (on a P4-M 2.2 Ghz, 512 MB). 
We looked at the partition with the best modularity (Q = 0.73), which contains 646 communities. For each 
router, we computed the ratio of routers in its community that belong to the same AS. The mean of this 
ratio over the routers is 52%, which shows that even in these bad conditions our algorithm is able to group 
together a significant portion of the router of an AS. 
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